Environment Setup Guide (2)
By Hongyu Xiao
Contact: hongyu.xiao@ou.edu
Environment Setup Guide for Research Computing
This guide provides instructions for setting up your research computing environment. Whether you're a new researcher, student, or staff member, proper environment configuration is crucial for efficient computational work.
Purpose and Objectives
The main objectives of this environment setup are:
- Configure a consistent and reproducible computing environment
- Enable access to necessary research software and tools
- Establish proper paths and dependencies for computational tasks
- Ensure security and best practices in research computing
By following this guide, you'll have a fully functional research computing environment that meets both your immediate needs and supports future scalability.
Environment Setup Guide
Configure your working environment:
- Environment Setup Guide
- Shell configuration (.bashrc, .bash_profile)
# Example .bashrc configuration # Custom aliases alias ll='ls -la' alias py='python3' alias jn='jupyter notebook' # Environment variables export PATH=$HOME/bin:$PATH export PYTHONPATH=$HOME/research/lib:$PYTHONPATH
Above is a basic example of a .bashrc configuration file that sets up common research computing environment elements including aliases, environment variables, and module loading.
# Example of my research computing aliases # Jupyter notebook on specific port alias jupyter_8989="jupyter notebook --no-browser --port=8989" # Navigate to my research data directory alias ogs_ourdisk="cd /ourdisk/hpc/ogs/hongyux/dont_archive/"
Note: The
/ourdisk/hpc/ogs/
directory is our OGS server storage location. Each user has an automatically created folder under this path following the structure:/ourdisk/hpc/ogs/yourusername/dont_archive/
. This is where you should store your research data and files.# Example of accessing your personal storage directory # Replace 'yourusername' with your actual username cd /ourdisk/hpc/ogs/yourusername/dont_archive/ # Creating a new directory for a specific project mkdir /ourdisk/hpc/ogs/yourusername/dont_archive/project_name
- Module system overview
Our HPC systems use the
module
system to manage software environments. To see available software modules, use:module avail # List all available modules module list # Show currently loaded modules module load python # Load Python module module unload python # Unload Python module
Common software modules include:
- Python (various versions)
- Compilers (gcc, intel)
- MPI libraries
- Machine learning frameworks (TensorFlow, PyTorch)
Note: Some modules may conflict with each other. Use 'module unload' before loading potentially conflicting modules.
Using SLURM for Efficient Computing
While Jupyter notebook access through tunneling is available, using SLURM for job management often provides better efficiency and resource utilization. Here's my template for a basic SLURM script:
Here's an example of a GPU-enabled SLURM script for deep learning tasks:
#!/bin/bash #SBATCH --partition=disc_dual_a100 # GPU partition #SBATCH --gres=gpu:1 # Request 1 GPU #SBATCH --output=job_%J_.txt # Output file #SBATCH --error=job_%J_.txt # Error file #SBATCH --ntasks=1 # Number of tasks #SBATCH --mem=100G # Memory request #SBATCH --time=24:00:00 # Time limit # Run your deep learning script python your_training_script.py
When using GPU resources, make sure to specify the appropriate partition (disc_dual_a100) and request GPU resources using the --gres flag. This ensures your job gets scheduled on nodes with available GPUs.
To submit your SLURM job, use:
sbatch your_script.sbatch
Common SLURM commands for job management:
squeue
-u $USER # Check your job queue
scancel
job_id # Cancel a specific job
sinfo
# Check partition information
This approach allows for better resource management and more efficient execution of computational tasks compared to interactive notebook sessions.
Here are examples of using squeue
and grep
to monitor jobs:
# View all jobs in the queue
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
123456 disc_dual python_tr hongyux R 2:30:15 1 node001
123457 disc_dual tensor_jo user2 R 12:45:22 1 node002
123458 disc_dual pytorch_t user3 PD 0:00:00 1 (Resources)
# Filter jobs on disc partitions
$ squeue | grep disc
123456 disc_dual python_tr hongyux R 2:30:15 1 node001
123457 disc_dual tensor_jo user2 R 12:45:22 1 node002
123458 disc_dual pytorch_t user3 PD 0:00:00 1 (Resources)
123459 disc_a100 train_ml user4 R 5:12:33 1 node003
The output shows job ID, partition name, job name, user, status (R=running, PD=pending), runtime, number of nodes, and node assignment or reason for pending.
Advanced SLURM Usage Tips
Here are some additional SLURM commands and features that can help you manage your computational jobs more effectively:
1. Job Dependencies
You can make jobs wait for other jobs to complete:
# Wait for job 123456 to complete before starting
sbatch --dependency=afterok:123456 next_job.sh
# Wait for job 123456 to fail before starting
sbatch --dependency=afternotok:123456 cleanup_job.sh
2. Resource Monitoring
Monitor your job's resource usage:
sstat
- View resource usage of running jobs
sacct
- View completed job information
# View detailed job information
sacct -j JobID --format=JobID,JobName,MaxRSS,Elapsed
# Monitor memory usage of running job
sstat --format=AveCPU,AveRSS,AveVMSize --jobs JobID
Conda Setup
OSCER could load miniconda
to create separate research environment for your need, for machine learning practice, different python package might be in need. Here is an example of setting up an environment.
# Step 1: Download and Install Miniconda
module load Miniconda3
# Step 2: Verify Miniconda Installation
conda --version
# Step 3: Create a New Environment with a Specific Python Version
# Replace X.Y with the desired Python version (e.g., 3.9, 3.10, 3.11)
conda create --name myproject python=X.Y
# Step 4: Activate the New Environment
conda activate myproject
# Step 5: Verify Python Version
python --version
# Optional: Install Specific Packages
conda install numpy pandas matplotlib
# To deactivate the environment when done
conda deactivate
# Useful Additional Commands:
# List all environments
conda env list
# Remove an environment
conda env remove --name myproject
You could also install conda
in your desired location,
Visit https://docs.conda.io/en/latest/miniconda.html to download the appropriate installer
CUDA/Seisbench Setup
For CUDA
, you could use module load
to load any module you wanted including CUDA
By default, module load CUDA
will load the most recent version of CUDA
, for example
[hongyux@schooner3 ~]$ module spider cuda
--------------------------------------------------------------------------------------------
CUDA:
--------------------------------------------------------------------------------------------
Description:
CUDA (formerly Compute Unified Device Architecture) is a parallel computing platform
and programming model created by NVIDIA and implemented by the graphics processing
units (GPUs) that they produce. CUDA gives developers access to the virtual
instruction set and memory of the parallel computational elements in CUDA GPUs.
Versions:
CUDA/5.5.22-GCC-4.8.2
CUDA/7.5.18-GCC-4.9.3-2.25
CUDA/7.5.18
CUDA/8.0.44-GCC-4.9.3-2.25
CUDA/8.0.44-intel-2016a
CUDA/8.0.61_375.26-GCC-5.4.0-2.26
CUDA/9.1.85-GCC-6.4.0-2.28
CUDA/9.2.88
CUDA/10.1.105-GCC-8.2.0-2.31.1
CUDA/10.1.243-GCC-8.3.0
CUDA/11.0.2-GCC-9.3.0
CUDA/11.1.1-GCC-10.2.0
CUDA/11.3.1
CUDA/11.5.0
CUDA/11.7.0
CUDA/11.8.0
CUDA/12.0.0
CUDA/12.1.1
CUDA/12.2.0
CUDA/12.3.0
In this scenario , if you type module load CUDA
, you will be getting the following:
[hongyux@schooner3 ~]$ module load CUDA
[hongyux@schooner3 ~]$ module list
Currently Loaded Modules:
1) binutils/2.38 2) M4/1.4.18 3) flex/2.6.4 4) CUDA/12.3.0
Here is an example of installing seisbench
# SeismoBench Installation Methods
# 1. Using pip (recommended for most users)
# Install a specific version
pip install seisbench==0.1.0
# Install the latest version
pip install seisbench
# 2. Conda Environment Installation
conda create -n seisbench python=3.9
conda activate seisbench
pip install seisbench
# 3. Additional Dependencies for Full Functionality
pip install torch torchvision torchaudio
pip install numpy pandas matplotlib
Here is an example of show seisbench version
(TL) [hongyux@schooner3 ~]$ pip show seisbench
Name: seisbench
Version: 0.7.0
Summary: The seismological machine learning benchmark collection
Home-page:
Author:
Author-email: Jack Woolam <jack.woollam@kit.edu>, Jannes Münchmeyer <munchmej@gfz-potsdam.de>
License: GPLv3
Location: /home/hongyux/.conda/envs/TL/lib/python3.12/site-packages
Requires: bottleneck, h5py, nest-asyncio, numpy, obspy, pandas, scipy, torch, tqdm
And seisbench has a good release page on Github. Please do take advantage of it https://github.com/seisbench/seisbench/releases
Check the seisbench version and CUDA version compatibility first if your code is not running.